Predicting COVID-19 and Respiratory Illness: Results of the 2022–2023 Armed Forces Health Surveillance Division Forecasting Challenge

Since 2019, the Integrated Biosurveillance Branch of the Armed Forces Health Surveillance Division has conducted an annual forecasting challenge during influenza season to predict short-term respiratory disease activity among Military Health System beneficiaries. Weekly case and encounter observed data were used to generate 1- through 4-week advanced forecasts of disease activity. To create unified combinations of model inputs for evaluation across multiple spatial resolutions, 8 individual models were used to calculate 3 ensemble models. The accuracy of forecasts compared to the observed activity for each model was evaluated by calculating a weighted interval score. Weekly 1- through 4-week ahead forecasts for each ensemble model were generally higher than observed data, especially during periods of peak activity, with peaks in forecasted activity occurring later than observed peaks. The larger the forecasting horizon, the more pronounced the gap between forecasted peak and observed peak. The results showed that several models accurately predicted COVID-19 cases and respiratory encounters with enough lead time for public health response by senior leaders. What are the new findings? By testing a large number of traditional (e.g., ARIMA, EWMA) and non-traditional (e.g., Random Forest, Count Regression) models, this forecasting study improved understanding of which model types were the most accurate and demonstrated a more robust ensemble prediction. The ensemble models developed by the forecasting challenge provided more accurate forecasts in general, when compared to most individual models. What is the impact on readiness and force health protection? Respiratory diseases represent a major impediment to military readiness and force health, including interruptions in duties caused by isolation or quarantine requirements as well as morbidity caused by illnesses themselves. Respiratory disease forecasting is a useful tool for senior leaders’ preparations for illness surges.

Since 2019, the Integrated Biosurveillance Branch of the Armed Forces Health Surveillance Division has conducted an annual forecasting challenge during influenza season to predict short-term respiratory disease activity among Military Health System beneficiaries.Weekly case and encounter observed data were used to generate 1-through 4-week advanced forecasts of disease activity.To create unified combinations of model inputs for evaluation across multiple spatial resolutions, 8 individual models were used to calculate 3 ensemble models.Forecast accuracy compared to the observed activity for each model was evaluated by calculating a weighted interval score.Weekly 1-through 4-week ahead forecasts for each ensemble model were generally higher than observed data, especially during periods of peak activity, with peaks in forecasted activity occurring later than observed peaks.The larger the forecasting horizon, the more pronounced the gap between forecasted peak and observed peak.The results showed that several models accurately predicted COVID-19 cases and respiratory encounters with enough lead time for public health response by senior leaders.

Predicting COVID-19 and Respiratory Illness: Results of the 2022-2023 Armed Forces Health Surveillance Division Forecasting Challenge
Mark L. Bova, MPH; Sasha A. McGee, PhD; Kathleen R. Elliott, MPH; Juan I. Ubiera, MPH, MS S easonal respiratory infections, including influenza and COVID-19, represent a major impediment to military readiness.Accurate forecasts of the burden of respiratory illness in the Department of Defense (DOD) population are crucial for allowing military leaders and public health practitioners to anticipate increases in disease activity and implement preventive measures.
Since 2013, the U.S. Centers for Disease Control and Prevention (CDC) has conducted an annual influenza forecasting challenge, inviting modelers to submit weekly forecasts of influenza-like illness (ILI) or confirmed influenza hospitalizations. 1 To produce more consistent and reliable forecasts across varying spatial resolutions, forecasting challenges often combine inputs from multiple models into one unified ensemble. 2ince 2019, the Integrated Biosurveillance (IB) Branch of the Armed Forces Health Surveillance Division (AFHSD), 3 part of the Defense Health Agency's Public Health Directorate, has conducted its own annual forecasting challenge during the influenza season, modeled after that of the CDC.The goal is to predict short-term (1-4 weeks ahead) respiratory disease activity among Military Health System (MHS) beneficiaries within collections of geographically-aligned military installations and medical facilities in the U.S. ("markets") to support timely decision-making by senior leaders.In addition to forecasting disease activity among MHS beneficiaries, AFHSD also forecasts activity among civilians living in counties within 30 miles of a market.This challenge is open to forecasts submitted by government, academic, and industry partners.
During influenza season, AFHSD-IB reports forecast data through weekly biosurveillance products emailed to more than 3,000 individuals.Stakeholders can access these data as needed to inform resource allocation and prevention activities via an interactive dashboard (by Common Access Card only) updated weekly by AFHSD-IB. 4his dashboard includes summary information about respiratory illness in each market and DHA network, as well as maps and time series plots of 1-through 4-week ahead forecasts.
This report summarizes the results and lessons from AFHSD's forecasts for the 2022-2023 forecasting season.

M e t h o d s
Influenza seasons were defined as epidemiological weeks 40 through 20 according to CDC's Morbidity and Mortality Weekly Report (MMWR) epidemiological weeks. 5The 2022-2023 influenza season began on October 2, 2022 and ended May

W h a t a r e t h e n e w f i n d i n g s ?
By testing a large number of traditional (e.g., ARIMA, EWMA) and non-traditional (e.g., Random Forest, Count Regression) models, this forecasting study improved understanding of which model types were the most accurate and demonstrated a more robust ensemble prediction.The ensemble models developed by this forecasting challenge provided more accurate forecasts in general, when compared to most individual models.

W h a t i s t h e i m p a c t o n r e a d i n e s s a n d f o r c e h e a l t h p r o t e c t i o n ?
Respiratory Weekly case and encounter observed data were used to generate 1-through 4-week ahead forecasts of disease activity.Forecasts were generated using various models, including time series (including Autoregressive Integrated Moving Average [ARIMA], Error, Trend, Seasonal [ETS], Exponentially Weighted Moving Average [EWMA], and Vector Autoregressive [VAR]), machine learning (including Random Forest), and count regression (including Poisson, Negative Binomial, and Log-binomial) models.To create unified combinations of model inputs for evaluation across multiple spatial resolutions, 8 individual models were used to calculate the 3 ensemble models: 1) the average of the time series and machine learning models-ENSEMBLE, 2) the average of the 3 best-performing time series and machine learning models-ENSEMBLE_TOP, and 3) the average of the count regression models-ENSEMBLE_CNT.
The accuracy of forecasts compared to the observed activity for each model was evaluated by calculating a weighted interval score (WIS), 10 a metric also used by the CDC, that compares performance among models.A lower score indicates better model performance.All analyses were conducted using R software (version 4.1, The R Foundation for Statistical Computing, Vienna, Austria).[13][14]

R e s u l t s
Weekly observed counts of MHS and civilian COVID-19 cases by market were converted to population-adjusted rates, while weekly observed MHS outpatient encounters were converted to a percentage of total outpatient encounters for that week.Weekly 1-through 4-week ahead forecasts for each ensemble model were generally higher than observed data, especially during periods of peak activity (December through February), with peaks in forecasted activity occurring later than observed peaks (Figure 1).The larger the forecasting horizon (i.e., 4 weeks ahead versus 1 week), the more pronounced the gap between forecasted peak and observed peak.
Forecasts of peak MHS COVID-19 case rates were mostly higher than observed, ranging from 44% higher for the ENSEMBLE_CNT model to 457% higher for the ENSEMBLE_TOP model (Table 1).Peak civilian COVID-19 case rate forecasts were more accurate, ranging from 13% lower (ENSEMBLE_CNT) to 99% higher (ENSEMBLE).Peak encounter forecasts for the ENSEMBLE_CNT model were lower than observed peaks (16% and 9% lower for ILI and CLI, respectively) and equal to the observed peak for COVID-19 encounters.Peak encounter forecasts for the ENSEMBLE_TOP model were higher than observed peaks, including 24% higher for ILI, 27% higher for CLI, and 10% higher for COVID-19 encounters.Peak week forecasts tended to be 2 to 6 weeks later than observed for most ensemble models and forecast targets.The ENSEMBLE_ CNT model accurately predicted forecasts of peak civilian COVID-19 cases and MHS ILI encounters, however.
Overall, the ENSEMBLE_CNT model had the lowest WIS of all forecasting horizons, indicating the most accurate forecasts for civilian and MHS COVID-19 cases (Figure 2).The ENSEMBLE_TOP model was the most accurate for COVID-19 encounter forecasts, while all 3 ensemble models performed similarly for CLI and ILI encounters.Model performance decreased as forecast horizons increased, with the median WIS for all 4-week ahead forecasts of the ensemble models increasing between 10% (MHS ILI encounters) and 98% (civilian COVID-19 cases) compared to 1-week ahead forecasts.performed similarly when compared to the best-performing ensemble model.Model performance decreased as the forecasting horizon increased, with WIS scores ranging from 10% to 95% higher on average for 4-week ahead forecasts compared to 1-week ahead forecasts.These results are consistent with a previous publication of COVID-19 forecasts in the U.S. COVID-19 Forecast Hub that found that an ensemble model comprised of 27 individual models was consistently more accurate than the individual models, and that the accuracy of forecasting models decreased as forecast horizons increased. 15his forecasting study has several strengths.First, the forecasting results showed that several models accurately predicted COVID-19 cases and respiratory encounters with enough lead time for senior leaders to take action.Second, this forecasting study tested a large number of traditional (e.g., ARIMA, EWMA) and non-traditional (e.g., Random Forest, Count Regression) models, increasing our understanding of which types of models were most accurate and providing a more robust ensemble prediction.

D i s c u s s i o n
The forecasting of the 2022-2023 season also showed several limitations that may have affected model accuracy.COVID-19 cases may have been generally under-reported due to the large number   16 Many states and military treatment facilities also changed their COVID-19 case reporting schedules, from daily to weekly, monthly, or not at all.To abridge some of gaps in COVID-19 reporting, health encounter data from DOD ESSENCE could be utilized, but syndromic surveillance systems such as ESSENCE may suffer from inconsistent data quality between reporting sites and gaps in coverage. 17In addition, these data can also lag by at least 4 days from the encounter date, leading to under-reporting of health encounters during the most recent week; these data present challenges for forecasting, as the observed value for this week may change significantly in subsequent weeks.During the 2022-2023 season, reported numbers of civilian and MHS COVID-19 cases for a given week increased by as much as 50% 1 month after an initial reporting date, as older cases were reported, while MHS encounter data ranged from a 40% decrease to a 40% increase as additional encounters populated the system.Efforts were made to account for potential backfill in each market for both case and encounter data prior to generating weekly forecasts, but forecasting analysis can be challenging due to unpredictable data processing schedules.Other limitations included the availability and usefulness of covariate data.Data that previously relied on for COVID-19 forecasting, including vaccination and case data, became less reliable or unavailable during the season.Another limitation of this study is the relative usefulness and timeliness of the forecasts.As mentioned, forecast accuracy decreased as forecasting horizon increased.
The data lags in ESSENCE, compounded by the time constraints of downloading and aggregating weekly data and generating weekly forecasts, meant that weekly forecasts were not available for senior leaders until nearly 1 week after the most recently observed data.This circumstance renders the 1-week ahead forecasts of disease activity mostly unusable, limiting senior leaders' response time to 2-week ahead forecasts.Although the 3-and 4-week ahead forecasts provide adequate time for senior leaders to make necessary preparations, their accuracy is greatly diminished compared to 1-and 2-week ahead forecasts.Efforts to improve the utility of 1-and 2-week ahead forecasts may be achieved by downloading data earlier each week and generating weekly forecasts more efficiently, but efforts for improving the more distant horizon forecasts and expanding beyond 4 weeks are current priorities.

F I G U R E 1 .Figure 1 .
Figure 1.Weekly Forecasts Versus Observed Data by Ensemble Models and Forecasting Horizon, All U.S. Surveillance

Table 1 .
Comparison of Observed and Forecasted Activity by Forecast Target, All U.S. Surveillance Markets, 2-Week Forecasting TA B L E 1.
The median WIS for that target, horizon, and model divided by the median WIS for that target and horizon, intended to show how well a model performed compared to the average, with green above average and red below average.Detection of SARS-CoV-2 nucleic acid (RNA) by molecular amplification from a clinical or autopsy specimen Probable: Meets any of the following criteria 1) Epidemiologically linked to another case of COVID-19 with no confirmatory COVID-19 laboratory testing and meets the following clinical description of a case: a.At least TWO of the following symptoms: fever, chills, rigors, myalgia, headache, sore throat, nausea or vomiting, diarrhea, fatigue, congestion or runny nose OR b.Any ONE of the following symptoms: cough, shortness of breath, difficulty breathing, new olfactory disorder, or new taste disorder OR c.